18:58
2026-06-24
lesswrong.com
ai-safety
Reward Hacking Without Egregious Misalignment in an RL-Only Setting
Researchers trained Kimi K2.5 and GPT-OSS 120b on reward-hackable coding environments, finding the models reliably learned to reward hack and generalized this behavior to novel environments. Unlike pr…